Explore the New MMP Data Access

A Healthy Waters Partnership Analysis

This script has been written to demonstrate how members of the Northern Three partnerships can access data AIMS data using the new data catalogue.
Author

Adam Shand

Published

July 18, 2025

1 Introduction

Recently (early 2025) AIMS underwent the process of modernising their data catalogue to meet CF 1.8 conventions. The objective was to deliver datasets in NetCDF format on the AODN THREDDS server. The NetCDF format allows AIMS to store large amounts of data in a space-efficent, accessible, and content-rich way, while also allowing much better control over the associated metadata to do things such as apply data quality flags.

Note

What is AODN THREDDS??

  • AODN = Australian Ocean Data Network, check them out here
  • THREDDS = Thematic Real-time Environmental Distribuited Data Services, more info here

The THREDDS Data Server (TDS) is a web server that provides metadata and data access for scientific datasets, using a variety of remote data access protocols. The AODN uses TDS to distribute its data.

The available remote data access protocols include OPeNDAP, OGC WCS, OGC WMS, and HTTP. We use OpeNDAP, but this is not of particular importance

The modernisation of AIMS’ catalogue has had 6 main impacts:

  1. Because data is stored on the THREDDS server, it is expected that technical staff “help themselves” to the data. This requires substantial technical knowedge and raises the barrier to entry for RRC technical staff.
  2. Because data is written as a .nc (netCDF) file, even if users download data it is no longer in a familar format that the RRC technical staff can use with training.
  3. Due to the increased data storage efficency, data are now stored in 10 minute intervals. Data must be aggregated to “daily” values by RRC technical staff in-house.
  4. With the introduction of a “time” variable, data entries are now recorded as UTC (Coordinated Universal Time) as is standard for publically distributed datasets. Times must be offset by +10 hours to bring data to EST by RRC technical staff in-house.
  5. Due to increased control over metadata, data quality now has sever quality flags. (Note that flags 1 and 2 are reccomended):
    • 0: No_QC_performed
    • 1: Good_data
    • 2: Probably_good_data
    • 3: Bad_data_that_are_potentially_correctable
    • 4: Bad_data
    • 5: Value_changed
  6. Because data is stored on the THREDDS server, it can be accessed via API.

To summarise, although the modernisation of AIMS’ catalogue might initially seem to result in additional work for the RRC technical staff, it will actually reduce overall effort. The update to using the AODN THREDDS servers now allows us to automate the data retrieval, manipulation, and download proceses. The following code will demonstrate these processes, and explore the datasets in depth:

2 Script Set Up

Only basic script setup is needed, below a series of packages are loaded.

Expand to view code.
#use pacman function to load and install (if required) all other packages
pacman::p_load(ncdf4, stars, xml2, stringr, glue, here, lubridate, ggplot2, tidyverse)

3 Accessing Data

As noted above, data are now stored on the AODN Thredds server. The benefit of this storage method is that we can access the data remotely.

It is not necessary to understand the inner workings of the data storage method. However;

  • You can find information about THREDDS storage generally here
  • You can “see” the data catalogue that we are going to remotely connect to here
  • You can learn about the OpeNDAP extration method here

Again, it is not necessary to know these things in detail unless you become the maintainer of this workflow. All data access processes should happen automatically.

3.1 Finding the Data

Step 1 in remotely accessing data is knowing where the data you are trying to access is located. Earlier I said you can “see” the data catalogue that we are going to remotely connect to at this url: https://thredds.aodn.org.au/thredds/catalog/AIMS/catalog.html. You can open this url and explore the folders like they were folders on your computer:

  • Click on “Marine_Monitoring_Program/”
  • Click on “FLNTU_timeseries/”
  • Pick a year
  • These are each of the timeseries datasets that we are going to be remotely accessing, pick any one you want. (Note about halfway in the filename is the name of the logger, e.g. “BUR1”).
  • This page provides basic metadata about the dataset, plus three ways to access the data, and three ways to view the data.
  • We are going to be using the OpeNDAP protocol, and thus we will click on the OpeNDAP access method
  • On this final page, the url where the data is located is shown at the very top. This url is copied into our code chunk below. Note that in the final version you will not have to manually retrive the url like this.
Note

Each logger’s url is almost identical, with only the logger name and deployment dates changing. Further, the url is fixed and will not change. This means we can automate the construction of a range of urls very quickly and easily.

Expand to view code.
#here is the url to the logger data
input_url <- "https://thredds.aodn.org.au/thredds/dodsC/AIMS/Marine_Monitoring_Program/FLNTU_timeseries/2024/AIMS_MMP-WQ_KUZ_20240217Z_BUR1_FV01_timeSeries_FLNTU.nc"

#here is a second url to a different logger, note how the url is almost identical
secondary_url <- "https://thredds.aodn.org.au/thredds/dodsC/AIMS/Marine_Monitoring_Program/FLNTU_timeseries/2024/AIMS_MMP-WQ_KUZ_20240211Z_FTZ6_FV01_timeSeries_FLNTU.nc"

The urls that we are going to use are composed of multiple sections, they will all contain:

  • The Server:
    • https://thredds.aodn.org.au/
  • The Service:
    • thredds/dodsC/
  • The Dataset:
    • AIMS/Marine_Monitoring_Program/FLNTU_timeseries/2024/AIMS_MMP-WQ_KUZ_20240211Z_FTZ6_FV01_timeSeries_FLNTU.nc

Both the server and the service are static, they will not change and we do not need to mess with them. However, as you might supect the dataset section of the url will change depending on the dataset we are trying to access. Specifically the deployment date, and the logger name will change (, in this case that would be “20240211Z” and “FTZ6”).

These two aspect of the url will vary because:

  • the deployment dates vary year on year and between loggers (e.g. the loggers are not always deployed on the 1st of July each year, nor are they all deployed at the same time each year - they could be a few days apart), and
  • the logger we want to access will change, and therefore the logger name will change.

Changing the name of the logger in the url is fairly straight forward. If you want the BUR1 logger, just type “BUR1”. If you want the BUR2 logger, type “BUR2”. Each logger maintains the same name for the entirety of its deployment life. So you never have to worry about tracking name changes. However, knowing the deployment dates of each logger is tricky and introduces a conundrum in automated url creation. This is because there is no way to predict the date that is contained in the url without first opening the url to see the date range. To resolve this we could either:

  • Manually open the catalogue like we did above to get the dates, or
  • Open the catalogue as an xml document and “scrape out” the dates associated with each logger

Since this is meant to be exploring the automated data access method, of course we are going to pick the scrape method. The steps to conduct this are covered in the following code chunk:

Expand to view code.
#set the target year variable (loggers are organised by year)
target_year <- 2024

#create a url to the same catalogue we manually navigated to earlier, but change the type to .xml note that years changes depending on variable above
catalogue_url <- glue("https://thredds.aodn.org.au/thredds/catalog/AIMS/Marine_Monitoring_Program/FLNTU_timeseries/{target_year}/catalog.xml")

#open the url as an object in R
catalogue <- read_html(catalogue_url)

#pull out the dataset variable from the raw xml
nc_files <- xml_find_all(catalogue, ".//dataset")

#pull out the id from this object (which is the name of each of the logger datasets)
file_names <- xml_attr(nc_files, "id")

#create a vector of logger names
logger_names <- str_extract_all(file_names, "_.{3,5}_(?=FV01)")
logger_names <- str_remove_all(unlist(logger_names), "_")

#create a vector of logger deployment dates
logger_dates <- str_extract_all(file_names, "\\d{8}(?=Z)")
logger_dates <- str_remove_all(unlist(logger_dates), "_")

#combine the two into a list, associating the logger name with its deployment date
all_loggers_metadata <- setNames(as.list(logger_dates), logger_names)

The result of this scrape produces the following list:

Expand to view code.
#veiw the list in a reader friendly format
format(all_loggers_metadata)
      FTZ1       FTZ2       FTZ6       WHI6       WHI4       WHI5      BUR13 
"20240210" "20240210" "20240211" "20240212" "20240213" "20240213" "20240214" 
      WHI1       BUR2       BUR4       BUR1      TUL10       TUL3        RM7 
"20240214" "20240216" "20240216" "20240217" "20240217" "20240218" "20240219" 
      RM10        RM1        RM8 
"20240220" "20240220" "20240220" 

We could then manually pick from this list, or automate a selection process to target a specific logger. Either way, we confirm we have the correct url by running the nc_open() function on the completed url:

Expand to view code.
#rotate through example url creations
for (i in 1:length(all_loggers_metadata)){

  #set the logger name:
  target_log_name <- names(all_loggers_metadata)[i]

  #set the logger date:
  target_log_date <- all_loggers_metadata[[i]]

  #build the completed url
  completed_url <- glue("https://thredds.aodn.org.au/thredds/dodsC/AIMS/Marine_Monitoring_Program/FLNTU_timeseries/2024/AIMS_MMP-WQ_KUZ_{target_log_date}Z_{target_log_name}_FV01_timeSeries_FLNTU.nc")

  #check if the url works, if it works this returns TRUE, otherwise it catches the error (and doesn't return TRUE)
  result <- tryCatch(
    {
      #open the url
      nc <- nc_open(completed_url)
      
      #if it works close and return TRUE
      nc_close(nc)
      TRUE
    }
  )
  
  #if the result was true, the url exists, if not, the url does not exist
  if (result){
    message("The ", target_log_name, " url exists.")
  } else {
    message("The ", target_log_name, " url does not exist.")
  }

}

Url creation is complete!

3.2 Opening the Data

So now we have confirmed the urls are legitimate, and we are assuming that these urls are going to the right place (i.e. the data we want). How do we actually get the data on to our computer? This is acheived in three key steps:

  1. Open a connection to the file using nc_open()
  2. Inspect the metadata of the file using the “$” operator to extract elements by name
  3. Use our findings from 2. to craft a request for the actual values in the file with ncvar_get()
Note

You can manually determine things such as variable names, lengths, and dimensions during the manual exploration stage we covered above if you are not very familar with ncdf files. However, the following code is a more comprehensive method for learning about the data and can be embedded into a automated workflow.

3.2.1 Steps 1 and 2

Opening a connection to the data is very straight forward:

Expand to view code.
#step 1, open the data
nc <- nc_open(completed_url)

3.2.1.1 Variables

Once open, we can access variable names as follows:

Expand to view code.
#step 2, explore the data using the $ operator, if you type this code yourself pause after pressing "$", this should open a tooltip that will list the available options
variable_names <- names(nc$var)

#print the variable names for the following discussion point
variable_names
[1] "TIMESERIES"           "LATITUDE"             "LONGITUDE"           
[4] "NOMINAL_DEPTH"        "CPHL"                 "CPHL_quality_control"
[7] "TURB"                 "TURB_quality_control"

With this information we can already learn alot about the data, for example:

  • We can see there are time, lat, and long variables.
    • These are particularly important for telling us when and where the logger was measuring water quality
  • We can see there are two variables that measure the actual water quality of the location:
    • CPHL, which is chlorophyll
    • TURB, which is turbidity
  • We can also see that each of the water quality variables have quality control variables associated with them
    • This will allow us to dynamically filter our water quality data to only get specific quality data. This related to the data flags we discussed at the start of this document.
  • Finally, there is a depth variable, although of lesser importance for our particular purpose it is still nice to know that we could access the deployment depth of the logger if we needed to.

3.2.1.2 Dimensions

However, variables names are not the only information we are interested in, or that we need to access. We also need to access dimension information.

Usually, netcdf files are 3D or 4D (emphasis on D = dimension). These would be Latitude, Longitude, Time, and Depth. But in the case of our logger data, these files are actually 1D: Time. I admit, this is slightly confusing because in the variable section above I said there is lat and long information. But essentially there is only 1 lat and 1 long value (because the logger is point data), so for the purposes of “dimensions” they dont exists.

We can see the name of our dimension as follows:

Expand to view code.
#extract the name of our dimension as follows
dimension_names <- names(nc$dim)

#print to console
dimension_names
[1] "TIME"

The distinction between “TIME” (from the dimensions information), and “TIMESERIES” (from the variables information) is an important one. If we want to extract the sample time for each water quality observation, we need to use the dimension (TIME), not the variable (TIMESERIES). If our data had additional dimensions (lat, long, depth), we would also need to get information about these from the dimensions section as well.

Note

There are plenty of things to learn using the $ operator, some examples are listed in the code chunk below:

Expand to view code.
#some other examples of things to be extracted using the $ operator include:
#nc$nvars
#nc$dim
#nc$var$CPHL$name
#nc$var$CPHL$size
#nc$var$LATITUDE$size
#nc$var$LATITUDE$units

3.2.2 Step 3

Ok, now that we have all of the variable and dimension names, what can we do? We can use ncvar_get() plus a variable or dimension name to extract information from the file. Since we are interested to get pretty much all of them I will just write this as a map.

Expand to view code.
#define a vector of variable names, note that we replace "TIMESERIES" with "TIME". This works because the ncvar_get() function will search through vars and dims for the associated name
vec_of_data_names <- str_replace(variable_names, "TIMESERIES", "TIME")

#map over the vector and extract the data associated with each name. Store the result in a list
target_data <- purrr::map(vec_of_data_names, function(x) ncvar_get(nc, x))

#create second version with only three time steps
dummy_data <- purrr::map(vec_of_data_names, function(x) ncvar_get(nc, x, start = 1, count = 3))

#rename each item in our list
names(target_data) <- vec_of_data_names
names(dummy_data) <- vec_of_data_names

For demonstration purposes I have run a second request with only three timesteps of the data to keep the file small while we explore what we got:

Expand to view code.
#print data
format(dummy_data)
                          TIME                       LATITUDE 
"27034.98, 27078.11, 27078.12"                    "-17.16215" 
                     LONGITUDE                  NOMINAL_DEPTH 
                     "146.007"                            "5" 
                          CPHL           CPHL_quality_control 
      "0.2546, 0.1206, 0.3484"                      "4, 4, 4" 
                          TURB           TURB_quality_control 
      "2.1504, 2.4832, 2.5856"                      "4, 4, 4" 

For the most part this all seems to make sense, the latitude and longitude seem reasonable (and only have 1 value because this is point data), the chlorophyll and turbidity values seem to be within expected concentrations, and the quality control flags match up with our note earlier. However, time is not looking too good right now… so lets fix that.

3.3 Translating the Data

Currently when we are looking at the time values they aren’t very helpful. If you look closely you can see that the values are increasing, but there are not units or contextual clues to tell us what the numbers mean.

This brings us to the third category of information that we need to access: attributes.

If variables are things like the water quality indicator, and dimensions are things like time, lat and long, attributes are things like the units of the variables and dimensions, or their long names, or their valid min and max values etc.

This means that if we access the attributes of our time dimension, we can see extra information about the time dimension - and hopefully figure out how to translate the numbers into something useful.

Expand to view code.
#access attributes of a variable using ncatt_get()
time_atts <- ncatt_get(nc, "TIME")

For example, the units of our time dimension are:

Expand to view code.
#view some of the attributes
time_atts$units
[1] "days since 1950-01-01 00:00:00 UTC"

This gives us a baseline to work against now. We can infer that if the time value was 0 that would equal 1950-01-01 00:00:00 UTC, thus the current numbers we are looking at correspond to some point 27,000ish days in the future. We can use the R package lubridate to calculate this:

Expand to view code.
#extract the current time vals
old_time_vals <- target_data$TIME

#assign an origin value to our "zero", make sure it has the UTC timezone, and contains hms
time_origin <- ymd_hms("1950-01-01 00:00:00", tz = "UTC")

#calculate new values by converting old values to absolute time intervals (purely total seconds), then adding that to our formatted origin
new_time_vals <- time_origin + ddays(old_time_vals)
#note that ddays() is not a typo, it stands for duration (rather than using days as a unit) This is because the vector contains fractional days which dont play nicely with the basic days() function

The translated dates look as follows:

Expand to view code.
#print new time vals
head(new_time_vals, 3)
[1] "2024-01-07 23:33:25 UTC" "2024-02-20 02:40:15 UTC"
[3] "2024-02-20 02:50:15 UTC"

However this is still using the UTC time zone. To convert to EST (the timezone for the GBR) we just need to add 10 hours to all values.

Expand to view code.
#update values from UTC to EST
new_time_vals <- new_time_vals + hours(10)

4 Using the Data

With all that out of the way, its finally time for us to do something with the data. So here is a simple line plot to start:

Expand to view code.
#create a dataframe from the time and the chla values
simple_df <- data.frame(Time = new_time_vals, 
                        Chlorophyll = target_data$CPHL)

#create a simple ggplot
ggplot(simple_df, aes(x = Time, y = Chlorophyll)) +
  geom_line() +
  theme_bw()

Aaannd immediately we can see why the data flags are necessary. So lets add those to our table, and then filter the data:

Expand to view code.
#create a new dataframe from the time, chla, and flag columns
simple_df <- data.frame(Time = new_time_vals, 
                        Chlorophyll = target_data$CPHL,
                        Flags = target_data$CPHL_quality_control)

#filter the data to keep only flags 1 and 2 ("good" and "probably good")
filtered_df <- simple_df |> 
  filter(Flags %in% c(1,2))

#create a simple ggplot
ggplot(filtered_df, aes(x = Time, y = Chlorophyll)) +
  geom_line() +
  theme_bw()

Much better! Next we need to convert data from its 10 minute sample intervals to its daily mean value:

Expand to view code.
#create a column that just tracks day not time
daily_df <- filtered_df |> 
  separate_wider_delim(cols = Time, delim = " ", names = c("Date", "Time"), cols_remove = TRUE) |> 
  mutate(Date = ymd(Date),
         Time = hms(Time))

#group by date and summarise
daily_df <- daily_df |> 
  group_by(Date) |> 
  summarise(Chlorophyll = mean(Chlorophyll, na.rm = T)) #note, we already filtered by flag so these can be dropped. and that na.rm should actually be needed if the flags work

#create a simple ggplot
ggplot(daily_df, aes(x = Date, y = Chlorophyll)) +
  geom_line() +
  theme_bw()

Looking Good.

5 Saving the Data

The final step in this document is to cover saving the file. Given that we extracted all of the information from the .nc file into a simple dataframe, it can just be saved using write_csv().

Expand to view code.
#simply use write csv on the df from before
#write_csv(daily_df, "example_output.csv")

6 Conclusions

This script broadly covers all of the steps that would be required to access, download, and manipulate the water quality data. However, this simplistic method does have the potential to introduce issues:

  • The actual method is fairly complex to understand for such a simple output, and conversely is largely automated with only a few key user inputs. Thus, the effort to reward ratio is very low.
  • With this method the user currently saves each year of data and each logger individually. This is annoying, and also increases the rate of human error.
  • When the CSV is saved, there is no acknowledgement of transformations completed. It is up to the user to remember what was done.

The natural observation of these three points is that a dashboard is required to bridge the gap. This dashboard will allow users to enter their few inputs, and then the complex code will execute in the background. This also allows for years and loggers to be grouped into a single document, and for transformations to be recorded in the output document.

Our Partners

Please visit our website for a list of HWP Partners.

Icons of all HWP partners  

A work by Adam Shand. Reuse: CC-BY-NC-ND.

to@drytropicshealthywaters.org

 

This work should be cited as:
Adam Shand, "[document title]", 2024, Healthy Waters Partnership for the Dry Tropics.